Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add asynchronous concurrent execution #3687

Open
wants to merge 3 commits into
base: docs/develop
Choose a base branch
from

Conversation

matyas-streamhpc
Copy link

No description provided.

@matyas-streamhpc matyas-streamhpc self-assigned this Nov 25, 2024
@neon60 neon60 force-pushed the async-doc branch 2 times, most recently from 1484d67 to f81588d Compare December 2, 2024 08:46
@neon60 neon60 marked this pull request as ready for review December 2, 2024 08:53
WIP

WIP

WIP

WIP

WIP
@neon60 neon60 force-pushed the async-doc branch 3 times, most recently from a8fd499 to fd5af51 Compare December 6, 2024 18:10
Copy link

@randyh62 randyh62 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left comments. Looks good overall.

Asynchronous concurrent execution
*******************************************************************************

Asynchronous concurrent execution important for efficient parallelism and

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Asynchronous concurrent execution important for efficient parallelism and
Asynchronous concurrent execution is important for efficient parallelism and

Asynchronous concurrent execution important for efficient parallelism and
resource utilization, with techniques such as overlapping computation and data
transfer, managing concurrent kernel execution with streams on single or
multiple devices or using HIP graphs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
multiple devices or using HIP graphs.
multiple devices, or using HIP graphs.

data allocation/freeing all happen in the context of device streams.

Streams are FIFO buffers of commands to execute in order on a given device.
Commands which enqueue tasks on a stream all return promptly and the command is

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Commands which enqueue tasks on a stream all return promptly and the command is
Commands which enqueue tasks on a stream all return promptly and the task is


Streams are FIFO buffers of commands to execute in order on a given device.
Commands which enqueue tasks on a stream all return promptly and the command is
executed asynchronously. Multiple streams may point to the same device and

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
executed asynchronously. Multiple streams may point to the same device and
executed asynchronously. Multiple streams can point to the same device and

Commands which enqueue tasks on a stream all return promptly and the command is
executed asynchronously. Multiple streams may point to the same device and
those streams may be fed from multiple concurrent host-side threads. Execution
on multiple streams may be concurrent but isn't required to be.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
on multiple streams may be concurrent but isn't required to be.
on multiple streams might be concurrent but isn't required to be.

contention for shared resources. This is because multiple kernels may attempt
to access the same GPU resources simultaneously, leading to delays.

Asynchronous kernel execution is beneficial only under specific conditions It

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Asynchronous kernel execution is beneficial only under specific conditions It
Asynchronous kernel execution is beneficial only under specific conditions. It

or from the GPU concurrently with kernel execution. Applications can query this
capability by checking the ``asyncEngineCount`` device property. Devices with
an ``asyncEngineCount`` greater than zero support concurrent data transfers.
Additionally, if host memory is involved in the copy, it should be page-locked

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reference we can provide such as Memory Management or something?


It is also possible to perform intra-device copies simultaneously with kernel
execution on devices that support the ``concurrentKernels`` device property
and/or with copies to or from the device (for devices that support the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
and/or with copies to or from the device (for devices that support the
and/or with copies to or from the device (for devices that support the

Are copies to or from the device intra-device copies?

called, control is not returned to the host thread before the device has
completed the requested task. The behavior of the host thread—whether to yield,
block, or spin—can be specified using :cpp:func:`hipSetDeviceFlags` with
specific flags. Understanding when to use synchronous calls is important for

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
specific flags. Understanding when to use synchronous calls is important for
appropriate flags. Understanding when to use synchronous calls is important for

By creating an event with :cpp:func:`hipEventCreate` and recording it with
:cpp:func:`hipEventRecord`, developers can synchronize operations across
streams, ensuring correct task execution order. :cpp:func:`hipEventSynchronize`
allows waiting for an event to complete before proceeding with the next

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
allows waiting for an event to complete before proceeding with the next
lets the application wait for an event to complete before proceeding with the next

sequences of kernels and memory operations as a single graph, they simplify
complex workflows and enhance performance, particularly for applications with
intricate dependencies and multiple execution stages.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants